A Comparison of Event Models for Naive Bayes Text Classi cation
نویسندگان
چکیده
Recent approaches to text classi cation have used two di erent rst order probabilistic models for classi ca tion both of which make the naive Bayes assumption Some use a multi variate Bernoulli model that is a Bayesian Network with no dependencies between words and binary word features e g Larkey and Croft Koller and Sahami Others use a multinomial model that is a uni gram language model with integer word counts e g Lewis and Gale Mitchell This paper aims to clarify the confusion by describing the di erences and details of these two models and by empirically comparing their classi cation performance on ve text corpora We nd that the multi variate Bernoulli performs well with small vocabulary sizes but that the multinomial performs usually performs even better at larger vocabulary sizes providing on average a reduction in error over the multi variate Bernoulli model at any vocabulary size
منابع مشابه
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملA Term Association Translation Model for Naive Bayes Text Classification
Text classi cation (TC) has long been an important research topic in information retrieval (IR) related areas. In the literature, the bag-of-words (BoW) model has been widely used to represent a document in text classi cation and many other applications. However, BoW, which ignores the relationships between terms, o ers a rather poor document representation. Some previous research has shown tha...
متن کاملNaive Bayes as a Satis cing Model
We report on an empirical study of supervised learning algorithms that induce models to resolve the meaning of ambiguous words in text. We nd that the Naive Bayesian classi er is as accurate as several more sophisticated methods. This is a surprising result since Naive Bayes makes simplifying assumptions about disambiguation that are not realistic. However, our results correspond to a growing b...
متن کاملAthena: Mining-based Interactive Management of Text Databases
We describe Athena: a system for creating, exploiting, and maintaining a hierarchical arrangement of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurat...
متن کاملAthena: Mining-Based Interactive Management of Text Database
We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003